Skip to content

Set per-device kernel shared memory on every GPU#806

Open
tvogels wants to merge 1 commit into
pyscf:masterfrom
tvogels:fix-initialization
Open

Set per-device kernel shared memory on every GPU#806
tvogels wants to merge 1 commit into
pyscf:masterfrom
tvogels:fix-initialization

Conversation

@tvogels

@tvogels tvogels commented Jun 26, 2026

Copy link
Copy Markdown

On a machine with more than one GPU, direct-SCF, gradient, and Hessian builds crash for any basis containing d or f shells with errors like CUDA Error in MD_build_j: invalid argument and RuntimeError: MD_build_j kernel for (dp|dp) failed; a single GPU always works and p-only bases work even on multiple GPUs.

The cause is that CUDA kernels needing more than 48 KB of dynamic shared memory require a per-device opt-in via cudaFuncSetAttribute, but the Python *_init wrappers were called only once at module import on device 0, while work is then distributed across all GPUs through multi_gpu.run, so any kernel launched on device 1 and beyond never received the opt-in and failed.

This was introduced in v1.5.0 and is still present through the current v1.7.4: commit 6af29a4 (#547) moved init_mdj_constant out of the per-device proc to module level, and a396d48 (#505) added the RYS_build_jk/RYS_build_k builders with module-level init from the start, with the later gradient, Hessian, and periodic kernels following the same broken pattern.

The condition is not covered by CI because the runners only have a single GPU, so they pass there and the regression went unnoticed.

The fix restores the original pattern by calling each *_init(SHM_SIZE) inside the per-device proc with an error check and demoting the module-level call to a .restype declaration, across ten proc sites in six files (scf/j_engine.py, scf/jk.py, grad/rhf.py, hessian/rhf.py, pbc/scf/j_engine.py, pbc/scf/rsjk.py), mirroring the rysj path that was always correct, with no change to single-GPU behaviour.

I verified it on four A100s: the SCF, gradient, Hessian, and periodic test suites that failed before now pass (j_engine 6, jk 12, grad 21, hessian 7, PBC j_engine+jk 36, PBC scf/stress 27), single-GPU runs are unchanged.

This fixes #623.

On a machine with more than one GPU, direct-SCF, gradient, and Hessian builds crash for any basis containing d or f shells with errors like `CUDA Error in MD_build_j: invalid argument` and `RuntimeError: MD_build_j kernel for (dp|dp) failed`; a single GPU always works and p-only bases work even on multiple GPUs.

The cause is that CUDA kernels needing more than 48 KB of dynamic shared memory require a per-device opt-in via `cudaFuncSetAttribute`, but the Python `*_init` wrappers were called only once at module import on device 0, while work is then distributed across all GPUs through `multi_gpu.run`, so any kernel launched on device 1 and beyond never received the opt-in and failed.

This was introduced in v1.5.0 and is still present through the current v1.7.4: commit 6af29a4 (pyscf#547) moved `init_mdj_constant` out of the per-device `proc` to module level, and a396d48 (pyscf#505) added the `RYS_build_jk`/`RYS_build_k` builders with module-level init from the start, with the later gradient, Hessian, and periodic kernels following the same broken pattern.

The condition is not covered by CI because the runners only have a single GPU, so they pass there and the regression went unnoticed.

The fix restores the original pattern by calling each `*_init(SHM_SIZE)` inside the per-device `proc` with an error check and demoting the module-level call to a `.restype` declaration, across ten `proc` sites in six files (scf/j_engine.py, scf/jk.py, grad/rhf.py, hessian/rhf.py, pbc/scf/j_engine.py, pbc/scf/rsjk.py), mirroring the rysj path that was always correct, with no change to single-GPU behaviour.

I verified it on four A100s: the SCF, gradient, Hessian, and periodic test suites that failed before now pass (j_engine 6, jk 12, grad 21, hessian 7, PBC j_engine+jk 36, PBC scf/stress 27), single-GPU runs are unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Direct SCF fails with multiple GPUs

1 participant